Systems and methods for performing branch prediction in a variable length instruction set microprocessor

ABSTRACT

A method of performing branch prediction in a microprocessor using variable length instructions is provided. An instruction is fetched from memory based on a specified fetch address and a branch prediction is made based on the address. The prediction is selectively discarded if the look-up was based on a non-sequential fetch to an unaligned instruction address and a branch target alignment cache (BTAC) bit of the instruction is equal to zero. In order to remove the inherent latency of branch prediction, an instruction prior to a branch instruction may be fetched concurrently with a branch prediction unit look-up table entry containing prediction information for a next instruction word. Then, the branch instruction is fetched and a prediction is made on this branch instruction based on information fetched in the previous cycle. The predicted target instruction is fetched on the next clock cycle. If zero overhead loops are used, a look-up table of a branch prediction unit is updated whenever the zero-overhead loop mechanism is updated. A last fetch address of a last instruction of a loop body of a zero overhead loop in the branch prediction look-up table is stored. Then, whenever an instruction fetch hits the end of a loop body, predictively re-directing an instruction fetch to the start of the loop body. The last fetch address of the loop body is derived from the address of the first instruction after the end of the loop.

CROSS REFERENCE TO RELATED APPLICATION(S)

This application claims priority to provisional application No.60/572,238 filed May 19, 2004, entitled “Microprocessor Architecture,”hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

This invention relates generally to microprocessor architecture and morespecifically to an improved architecture and mode of operation of amicroprocessor for performing branch prediction.

BACKGROUND OF THE INVENTION

A typical component of a multistage microprocessor pipeline is thebranch prediction unit (BPU). Usually located in or near a fetch stageof the pipelines the branch prediction unit increases effectiveprocessing speed by predicting whether a branch to a non-sequentialinstruction will be taken based upon past instruction processinghistory. The branch prediction unit contains a branch look-up orprediction table that stores the address of branch instructions, anindication as to whether the branch was taken and a speculative targetaddress for a taken branch. When an instruction is fetched, if theinstruction is a conditional branch, the result of the conditionalbranch is speculatively predicted based on past branch history. Thisspeculative or predictive result is injected into the pipeline. Thus,referencing a branch history table, the next instruction isspeculatively loaded into the pipeline. Whether or not the predictionwill be correct, will not be known until a later stage of the pipeline.However, if the prediction is correct, clock cycles will be saved by nothaving to go back to get the next instruction address. Otherwise, thecurrent pipeline behind the stage in which the actual address of thenext instruction is determined must be flushed and the correct branchinserted back in the first stage. While this may seem like a harshpenalty for incorrect predictions, in applications where the instructionset is limited and small loops are repeated many times, such as, forexample, applications typically implemented with embedded processors,branch prediction is usually accurate enough such that the benefitsassociated with correct predictions outweigh the cost of occasionalincorrect predictions—i.e., pipeline flush. In these types ofapplications branch prediction can achieve accuracy over ninety percentof the time. Thus, the risk of predicting an incorrect branch resultingin a pipeline flush is outweighed by the benefit of saved clock cycles.

While branch prediction is effective at increasing effective processingspeed, problems may arise that reduce or eliminate these efficiencygains when dealing with a variable length microprocessor instructionset. For example, if the look-up table is a comprised of entriesassociated with 32-bit wide fetch entities and instructions have lengthsvarying from 16 to 64-bits, a specific lookup table address entry maynot be sufficient to reference a particular instruction.

The description herein of various advantages and disadvantagesassociated with known apparatus, methods, and materials is not intendedto limit the scope of the invention to their exclusion. Indeed, variousembodiments of the invention may include one or more of the knownapparatus, methods, and materials without suffering from theirdisadvantages.

As background to the techniques discussed herein, the followingreferences are incorporated herein by reference: U.S. Pat. No. 6,862,563issued Mar. 1, 2005 entitled “Method And Apparatus For Managing TheConfiguration And Functionality Of A Semiconductor Design” (Hakewill etal.); U.S. Ser. No. 10/423,745 filed Apr. 25, 2003, entitled “Apparatusand Method for Managing Integrated Circuit Designs”; and U.S. Ser. No.10/651,560 filed Aug. 29, 2003, entitled “Improved ComputerizedExtension Apparatus and Methods”, all assigned to the assignee of thepresent invention.

SUMMARY OF THE INVENTION

Thus, there exists a need for microprocessor architecture with reducedpower consumption, improved performance, reduction of silicon footprintand improved branch prediction as compared with state of the artmicroprocessors.

In various embodiments of this invention, a microprocessor architectureis disclosed in which branch prediction information is selectivelyignored by the instruction pipeline in order to avoid injection oferroneous instructions into the pipeline. These embodiments areparticularly useful for branch prediction schemes in which variablelength instructions are predictively fetched. In various exemplaryembodiments, a 32-bit word is fetched based on the address in the branchprediction table. However, in branch prediction systems based onaddresses of 32-bit fetch objects, because the instruction memory iscomprised of 32-bit entries, regardless of instruction length, thisaddress may reference a word comprising two 16-bit instruction words, ora 16-bit instruction word and an unaligned instruction word of largerlength (32, 48 or 64 bits) or parts of two unaligned instruction wordsof such larger lengths.

In various embodiments, the branch prediction table may contain a tagcoupled to the lower bits of a fetch instruction address. If the entryat the location specified by the branch prediction table contains morethan one instruction, for example, two 16-bit instructions, or a 16-bitinstruction and a portion of a 32, 48 or 64-bit instruction, aprediction may be made based on an instruction that will ultimately bediscarded. Though the instruction aligner will discard the incorrectinstruction, a predicted branch will already have been injected into thepipeline and will not be discovered until branch resolution in a laterstage of the pipeline causing a pipeline flush.

Thus, in various exemplary embodiments, to prevent such an incorrectprediction from being made, a prediction will be discarded beforehand iftwo conditions are satisfied. In various embodiments, a prediction willbe discarded if a branch prediction look-up is based on a non-sequentialfetch to an unaligned address, and secondly, if the branch targetalignment cache (BTAC) bit is equal to zero. This second condition willonly be satisfied if the prediction is based on an instruction having analigned instruction address. In various exemplary embodiments, analignment bit of zero will indicate that the prediction information isfor an aligned branch. This will prevent the predictions based onincorrect instructions from being injected into the pipeline.

In various embodiments of this invention, a microprocessor architectureis disclosed which utilizes dynamic branch prediction while removing theinherent latency involved in branch prediction. In this embodiment, aninstruction fetch address is used to look up in a BPU table recordinghistorical program flow to predict when a non-sequential program flow isto occur. However, instead of using the instruction address of thebranch instruction to index the branch table, the address of theinstruction prior to the branch instruction in the program flow is usedto index the branch in the branch table. Thus, fetching the instructionprior to the branch instruction will cause a prediction to be made andeliminate the inherent one step latency in the process of dynamic branchprediction caused by the fetching the address of the branch instructionitself. In the above embodiment, it should be noted that in some cases,a delay slot instruction may be inserted after a conditional branch suchthat the conditional branch is not the last sequential instruction. Insuch a case, because the delay slot instruction is the actual sequentialdeparture point, the instruction prior to the non-sequential programflow would actually be the branch instruction. Thus, the BPU would indexsuch an entry by the address of the conditional branch instructionitself, since it would be the instruction prior to the non-sequentialinstruction.

In various embodiments, use of a delay slot instruction will also affectbranch resolution in the selection stage. In various exemplaryembodiments, if a delay slot instruction is utilized, update of the BPUmust be deferred for one execution cycle after the branch instruction.This process is further complicated by the use of variable lengthinstructions. Performance of branch resolution after execution requiresupdating of the BPU table. However, when the processor instruction setincludes variable length instructions it becomes essential to determinethe last fetch address of the current instruction as well as the updateaddress, i.e., the fetch address prior to the sequential departurepoint. In various exemplary embodiments, if the current instruction isan aligned or non-aligned 16-bit or an aligned 32-bit instruction, thelast fetch address will be the instruction fetch address of the currentinstruction. The update address of an aligned 16-bit or aligned 32-bitinstruction will be the last fetch address of the prior instruction. Fora non-aligned 16-bit instruction, if it was arrived at sequentially, theupdate address will be the update address of the prior instruction.Otherwise, the update address will be the last fetch address of theprior instruction.

In the same embodiment, if the current instruction is non-aligned 32-bitor an aligned 48-bit instruction, the last fetch address will simply bethe address of the next instruction. The update address will be thecurrent instruction address. The last fetch address of a non-aligned48-bit instruction or an aligned 64-bit instruction will be the addressof the next instruction minus one and the update address will be thecurrent instruction address. If the current instruction is a non-aligned64-bit, the last fetch address will be the same as the next instructionaddress and the update address will be the next instruction addressminus one.

In exemplary embodiments of this invention, a microprocessorarchitecture is disclosed which employs dynamic branch prediction andzero overhead loops. In such a processor, the BPU is updated wheneverthe zero-overhead loop mechanism is updated. Specifically, the BPU needsto store the last fetch address of the last instruction of the loopbody. This allows the BPU to predictively re-direct instruction fetch tothe start of the loop body whenever an instruction fetch hits the end ofthe loop body. In this embodiment, the last fetch address of the loopbody can be derived from the address of the first instruction after theend of the loop, despite the use of variable length instructions, byexploiting the fact that instructions are fetched in 32-bit word chunksand that instruction sizes are in general integer multiple of a 16-bits.Therefore, in this embodiment, if the next instruction after the end ofthe loop body has an aligned address, the last instruction of the loopbody has a last fetch address immediately preceding the address of thenext instruction after the end of the loop body. Otherwise, if the nextinstruction after the end of the loop body has an unaligned address, thelast instruction of the loop body has the same fetch address as the nextinstruction after the loop body.

At least one exemplary embodiment of the invention provides a method ofperforming branch prediction in a microprocessor using variable lengthinstructions. The method of performing branch prediction in amicroprocessor using variable length instructions according to thisembodiment comprises fetching an instruction from memory based on aspecified fetch address, making a branch prediction based on the addressof the fetched instruction, and discarding the branch prediction if (1)the branch prediction look-up was based on a non-sequential fetch to anunaligned instruction address and (2) if a branch target alignment cache(BTAC) bit of the instruction is equal to zero.

At least one additional exemplary embodiment provides a method ofperforming dynamic branch prediction in a microprocessor. The method ofperforming dynamic branch prediction in a microprocessor according tothis embodiment may comprise fetching the penultimate instruction wordprior to a non-sequential program flow and a branch prediction unitlook-up table entry containing prediction information for a nextinstruction word on a first clock cycle, fetching the last instructionword prior to a non-sequential program flow and making a prediction onnon-sequential program flow based on information fetched in the previouscycle on a second clock cycle, and fetching the predicted targetinstruction on a third clock cycle.

Yet an additional exemplary embodiment provides a method of updating alook-up table of a branch prediction unit in a variable lengthinstruction set microprocessor. The method of updating a look-up tableof a branch prediction unit in a variable length instruction setmicroprocessor may comprise storing a last fetch address of a lastinstruction of a loop body of a zero overhead loop in the branchprediction look-up table, and predictively re-directing an instructionfetch to the start of the loop body whenever an instruction fetch hitsthe end of a loop body, wherein the last fetch address of the loop bodyis derived from the address of the first instruction after the end ofthe loop.

Other aspects and advantages of the invention will become apparent fromthe following detailed description, taken in conjunction with theaccompanying drawings, illustrating by way of example the principles ofthe invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram illustrating the contents of a 32-bit instructionmemory and a corresponding table illustrating the location of particularinstructions within the instruction memory in connection with atechnique for selectively ignoring branch prediction information inaccordance with at least one exemplary embodiment of this invention;

FIG. 2 is a flow chart illustrating the steps of a method forselectively discarding branch predictions corresponding to aligned16-bit instructions having the same fetch address as a non-aligned16-bit target instruction in accordance with at least one exemplaryembodiment of this invention;

FIG. 3 is a flow chart illustrating a prior art method of performingbranch prediction by storing non-sequential branch instructions in abranch prediction unit table that is indexed by the fetch address of thenon-sequential branch instruction;

FIG. 4 is a flow chart illustrating a method for performing branchprediction by storing non-sequential branch instructions in a branchprediction table that is indexed by the fetch address of the instructionprior to the non-sequential branch instruction in accordance with atleast one exemplary embodiment of this invention;

FIG. 5 is a diagram illustrating possible scenarios encountered duringbranch resolution when 32-bit words are fetched from memory in a systemincorporating a variable length instruction architecture includinginstructions of 16-bits, 32-bits, 48-bits or 64-bits in length; and

FIGS. 6 and 7 are tables illustrating a method for computing the lastinstruction fetch address of a zero-overhead loop for dynamic branchprediction in a variable-length instruction set architecture processor;

DETAILED DESCRIPTION OF THE DISCLOSURE

The following description is intended to convey a thorough understandingof the invention by providing specific embodiments and details involvingvarious aspects of a new and useful microprocessor architecture. It isunderstood, however, that the invention is not limited to these specificembodiments and details, which are exemplary only. It further isunderstood that one possessing ordinary skill in the art, in light ofknown systems and methods, would appreciate the use of the invention forits intended purposes and benefits in any number of alternativeembodiments, depending upon specific design and other needs.

FIG. 1 is a diagram illustrating the contents of a 32-bit instructionmemory and a corresponding table illustrating the location of particularinstructions within the instruction memory in connection with atechnique for selectively ignoring branch prediction information inaccordance with at least one exemplary embodiment of this invention.When branch prediction is done in a microprocessor employing a variablelength instruction set, a performance problem is created when a branchis made to an unaligned target address that is packed with an alignedinstruction in the same 32-bit word that is predicted to be a branch.

In FIG. 1 a sequence of 32-bit wide memory words are shown containinginstructions instr_1 through instr_4 in sequential locations in memory.Instr_2, is the target of a non-sequential instruction fetch. The BPUstores prediction information in its tables based only on the 32-bitfetch address of the start of the instruction. There can be more thanone instruction in any 32-bit word in memory, however, only oneprediction can be made per 32-bit word. Thus, the performance problemcan be seen by referring to FIG. 1. The instruction address of instr_2is actually 0x2, however, the fetch address is 0x0, and a fetch of thisaddress will cause the entire 32-bit word comprised of 16-bits ofinstr_1 and 16-bits of instr_2 to be fetched. Under a simple BPUconfiguration, a branch prediction will be made for instr_1 based on theinstruction fetch of the 32-bit word at address 0x0. The branchpredictor does not take into account the fact that the instr_1 at 0x0will be discarded by the aligner before it can be issued, however theprediction remains. The prediction would be correct if instr_1 isfetched as the result of a sequential fetch of 0x0, or if a branch wasmade to 0x0, but, in this case, where a branch is made to instr_2 at0x2, the prediction is wrong. As a result, the prediction is wrong forinstr_2 causing an incorrect instruction to hit the backstop and apipeline flush, a severe performance penalty, to occur.

FIG. 2 is a flow chart outlining the steps of a method for solving theaforementioned problem by selectively discarding branch predictioninformation in accordance with various embodiments of the invention.Operation of the method begins at step 200 and proceeds to step 205where a 32-bit word is read from memory at the specified fetch addressof the target instruction. Next, in step 210, a prediction is made basedon this fetched instruction. This prediction is based on the alignedinstruction fetch location. Operation of the method then proceeds tostep 215 where the first part of a two-part determination test isapplied to whether the branch prediction lookup is based on anon-sequential fetch to an unaligned instruction address. In the contextof FIG. 1, this condition would be satisfied by instr_2 because it isnon-aligned (it does not start at the beginning of line 0x0, but ratherafter the first 16-bits). However, this condition alone is notsufficient because a valid branch prediction lookup can be based on abranch located at an unaligned instruction address. For example, if inFIG. 1 instr_1 is not a branch and instr_2 is a branch. If, in step 215,it is determined that the branch prediction lookup is based on anon-sequential fetch to an unaligned instruction address, operation ofthe method proceeds to the next step of the test, step 220. Otherwise,operation of the method jumps to step 225, where the prediction isassumed valid and passed.

Returning to step 220, in this step a second determination is made as towhether the branch target address cache (BTAC) alignment bit is 0,indicating that the prediction information is for an aligned branch.This bit will be 0 for all aligned branches and will be 1 for allunaligned branches because it is derived from the instruction address.The second bit of the instruction address will always be 0 for alignedbranches (i.e., 0, 4, 8, f, etc.) and will always be 1 for unalignedbranches (i.e., 2, 6, a, etc.). If, in step 220, it is determined thatthe branch target address cache (BTAC) alignment bit is not 0, operationproceeds to step 225 where the prediction is passed. Otherwise, if instep 220 it is determined that the BTAC alignment bit is 0, operation ofthe method proceeds to step 230, where the prediction is discarded.Thus, rather than causing an incorrect instruction to be injected intothe pipeline which will ultimately cause a pipeline flush, the nextsequential instruction will be correctly fetched. After step 230,operation of the method is the same as after step 225, where the nextfetch address is updated in step 235 based on whether a branch waspredicted and returns to step 205 where the next fetch occurs.

As discussed above, dynamic branch prediction is an effective techniqueto reduce branch penalty in a pipeline processor architecture. Thistechnique uses the instruction fetch address to look up in internaltables recording program flow history to predict the target of anon-sequential program flow. Also, discussed above, branch prediction iscomplicated when a variable-length instruction architecture is used. Ina variable-length instruction architecture, the instruction fetchaddress cannot be assumed to be identical to the actual instructionaddress. This makes it difficult for the branch prediction algorithm toguarantee sufficient instruction words are fetched and at the same timeminimize unnecessary fetches.

One known method of ameliorating this problem is to add extra pipelinestages to the front of the processor pipeline to perform branchprediction prior to the instruction fetch to allow more time for theprediction mechanism to make a better decision. A negative consequenceof this approach is that extra pipeline stages increase the penalty tocorrect an incorrect prediction. Alternatively, the extra pipelinestages would not be needed if prediction could be performed concurrentto instruction fetch. However, such a design has an inherent latency inwhich extra instructions are already fetched by the time a prediction ismade.

Traditional branch prediction schemes use the instruction address of abranch instruction (non-sequential program instruction) to index itsinternal tables. FIG. 3 illustrates such a conventional indexing methodin which two instructions are sequentially fetched, the firstinstruction being a branch instruction, and the second being the nextsequential instruction word. In step 300, the branch instruction isfetched with the associated BPU table entry. In the next clock cycle, instep 305, this instruction is propagated in the pipeline to the nextstage where it is detected as a predicted branch while the nextinstruction is fetched. Then, at step 310, in the next clock cycle, thetarget instruction is fetched based on the branch prediction made in thelast cycle. Thus, a latency is introduced because three steps arerequired to fetch the branch instruction, make a prediction and fetchthe target instruction. If the instruction word fetched in 305 is notpart of the branch nor of its delay slot, then the word is discarded andas a result a “bubble” is injected into the pipeline.

FIG. 4 illustrates a novel and improved method for making a branchprediction in accordance with various embodiments of the invention. Themethod depicted in FIG. 4 is characterized in that the instructionaddress of the instruction preceding the branch instruction is used toindex the BPU table rather than the instruction address of the branchinstruction itself. As a result, by fetching the instruction just priorto the branch instruction, a prediction can be made from the address ofthis instruction while the branch instruction itself is being fetched.

Referring specifically to FIG. 4, the method begins in step 400 wherethe instruction prior to the branch instruction is fetched together withthe BPU entry containing prediction information of the next instruction.Next, in step 405, the branch instruction is fetched while,concurrently, a prediction on this branch can be made based oninformation fetched in the previous cycle. Then, in step 410, in thenext clock cycle, the target instruction is fetched. As illustrated, noextra instruction word is fetched between the branch and the targetinstructions. Hence, no bubble will be injected into the pipeline andoverall performance of the processor is improved.

It should be noted that in some cases, due to the use of delay slotinstructions, the branch instruction may not be the departure point (theinstruction prior to non-sequential flow). Rather another instructionmay appear after the branch instruction. Therefore, though thenon-sequential jump is dictated by the branch instruction, the lastinstruction to be executed may not be the branch instruction, but mayrather be the delay slot instruction. A delay slot is used in someprocessor architectures with short pipelines to hide branch resolutionlatency. Processors with dynamic branch prediction might still have tosupport the concept of delay slots to be compatible with legacy code.Where a delay slot instruction is used after the branch instruction,utilizing the above branch prediction scheme will cause the instructionaddress of the branch instruction, not the instruction before the branchinstruction, to be used to index the BPU tables, because thisinstruction is actually the instruction before the last instruction.This fact has significant consequences for branch resolution as will bediscussed below. Namely, in order to effectively perform branchresolution, we must know the last fetch address of the previousinstruction.

As stated above, branch resolution occurs in the selection stage of thepipeline and causes the BPU to be updated to reflect the outcome of theconditional branch during the write-back stage. Referring to FIG. 5,FIG. 5 illustrates five potential scenarios encountered when performingbranch resolution. These scenarios may be grouped into two groups by theway in which they are handled. Group one comprises a non-aligned 16-bitinstruction and an aligned 16 or 32-bit instruction. Group two comprisesone of three scenarios: a non-aligned 32 or 48-bit instruction, anon-aligned 48-bit or an aligned 64-bit instruction, and a non-aligned64-bit instruction.

Two pieces of information need to be computed for every instructionunder this scheme: namely the last fetch address of the currentinstruction, L₀ and the update address of the current instruction U₀. Inthe case of the scenarios of group one, it is also necessary to knowL⁻¹, the last fetch address of the previous instruction, and U⁻¹, theupdate address of the previous instruction. Looking at both the firstscenario and second scenarios, a non-aligned 16 bit instruction, and analigned 16 or 32 bit instruction respectively, L₀ is simply the 30 mostsignificant bits of the fetch address denoted as instr_addr[31:2].However, because in both of the scenarios, the instruction address spansonly one fetch address line, the update address U₀ depends on whetherthese instructions were arrived at sequentially or as the result of anon-sequential instruction. However, in keeping with the methoddiscussed in the context of FIG. 4, we know the last fetch address ofthe prior instruction , also known as L⁻¹. This information is storedinternally and is available as a variable to the current instruction inthe select stage of the pipeline. In the first scenario, if the currentinstruction is arrived at through sequential program flow, it has thesame departure address as the prior instruction and hence U₀ will beU⁻¹. Otherwise, the update address will be the last fetch address of theprior non-sequential instruction L⁻¹. In the second scenario, the updateaddress of a 16 or 32-bit aligned instruction, U₀ will be the last fetchaddress of the prior instruction L⁻¹, irrespective of whether the priorinstruction was sequential or not.

Scenarios 3-5 can be handled in the same manner by taking advantage ofthe fact that each instruction fetch fetches a contiguous 32-bit word.Therefore, when the instruction is sufficiently long and/or unaligned tospan two or more consecutive fetched instruction words in memory, weknow with certainty that L₀, the last fetch address, can be derived fromthe instruction address of the next sequential instruction, denoted asnext_addr[31:2] in FIG. 5. In scenarios 3 and 5, covering non-aligned32-bit, aligned 48-bit and non-aligned 64-bit instructions, the lastportion of the current instruction share the same fetch address with thestart of the next sequential instruction. Hence L₀ will benext_addr[31:2]. In scenario 4, covering non-aligned 48-bit or aligned64-bit instructions, the fetch address of the last portion of thecurrent instruction is one less than the start address of the nextsequential instruction. Hence, L₀=next_addr[31:2]−1. On the other hand,in scenario 3 and 4, the current instruction spans two consecutive32-bit fetched instruction words. The fetch address prior to the lastportion of the current instruction is always the fetch address of thestart of the instruction. Therefore, U₀ will be inst_addr[31:2]. Inscenario 5, the last portion of the current instruction shares the samefetch address as the start of the next sequential instruction. Hence, U₀will be next_addr[31:2]−1. In the scheme just described, the updateaddress U₀ and last fetch address L₀ are computed based on 4 values thatare provided to the selection stage as early arriving signals directlyfrom registers. These signals are namely inst_addr, next_addr, L⁻¹ andU⁻¹. Only one multiplexer is required to compute U₀ in scenario 1, andone decrementer is required to compute L₀ in scenario 4 and U₀ inscenario 5. The overall complexity of the novel and improved branchprediction method being disclosed is only marginally increased comparingwith traditional methods.

In yet another embodiment of the invention, a method and apparatus areprovided for computing the last instruction fetch of a zero-overheadloop for dynamic branch prediction in a variable length instruction setmicroprocessor. Zero-overhead loops, as well as the previously discusseddynamic branch prediction, are both powerful techniques for improvingeffective processor performance. In a microprocessor employing bothtechniques, the BPU has to be updated whenever the zero-overhead loopmechanism is updated. In particular, the BPU needs the last instructionfetch address of the loop body. This allows the BPU to re-directinstruction fetch to the start of the loop body whenever an instructionfetch hits the end of the loop body. However, in a variable-lengthinstruction architecture, determining the last fetch address of a loopbody is not trivial. Typically, a processor with a variable-lengthinstruction set only keeps track of the first address an instruction isfetched from. However, the last fetch address of a loop body is thefetch address of the last portion of the last instruction of the loopbody and is not readily available.

Typically, a zero-overhead loop mechanism requires an address related tothe end of the loop body to be stored as part of the architecturalstate. In various exemplary embodiments, this address can be denoted asLP_END. If LP_END is assigned the address of the next instruction afterthe last instruction of the loop body, the last fetch address of theloop body, designated in various exemplary embodiments as LP_LAST, canbe derived by exploiting two facts. Firstly, despite the variable lengthnature of the instruction set, instructions are fetched in fixed sizechunks, namely 32-bit words. The BPU works only with the fetch addressof theses fixed size chunks. Secondly, instruction sizes ofvariable-length are usually an integer multiple of a fixed size, namely16-bits. Based on these facts, an instruction can be classified asaligned if the start address of the instruction is the same as the fetchaddress. If LP_END is an aligned address, LP_LAST must be the fetchaddress that precedes that of LP_END. If LP_END is non aligned, LP_LASTis the fetch address of LP_END. Thus, the equationLP_LAST=LP_END[31:2]−(˜LP_END[1]) can be used to derive the LP_LASTwhether or not LP_END is aligned.

Referring to FIGS. 6 and 7, two examples are illustrated in which LP_ENDis both non-aligned and aligned. In both cases, the instruction “sub” isthe last instruction of the loop body. In the first case, LP_END islocated at 0xA. In this case, LP_END is unaligned and LP_END[1] is 1,thus, the inversion of LP_END[1] is 0 and the last fetch address of theloop body, LP_LAST, is LP_END[32:2] which is 0x8. In the second case,LP_END is aligned and located at 0x18. LP_END[1] is 0, as with allaligned instructions, thus, the inversion of LP_END is 1 and LP₁₃LAST isLP_END[31:2]−1 or the line above LP_LAST, line 0x14. Note that in theabove calculations least significant bits of addresses that are known tobe zero are ignored for the sake of simplifying the description.

While the foregoing description includes many details and specificities,it is to be understood that these have been included for purposes ofexplanation only, and are not to be interpreted as limitations of thepresent invention. Many modifications to the embodiments described abovecan be made without departing from the spirit and scope of theinvention.

1. In a microprocessor, a method of performing branch prediction usingvariable length instructions, the method comprising: fetching aninstruction from memory based on a specified fetch address; making abranch prediction based on the address of the fetched instruction; anddiscarding the branch prediction if: (1) a branch prediction look-up wasbased on a non-sequential fetch to an unaligned instruction address; and(2) a branch target alignment cache (BTAC) bit of the instruction isequal to a predefined value.
 2. The method according to claim 1, furthercomprising passing a predicted instruction associated with the branchprediction if either (1) or (2) is false.
 3. The method according toclaim 2, further comprising updating a next fetch address if a branchprediction is incorrect.
 4. The method according to claim 3, wherein themicroprocessor comprises an instruction for pipeline having a selectstage, and updating comprises after resolving a branch in the selectstage, updating the branch prediction unit (BPU) with the address of thenext instruction resulting from that branch.
 5. The method according toclaim 1, wherein making a branch prediction comprises parsing a branchlook-up table of a branch prediction unit (BPU) that indexesnon-sequential branch instructions by their addresses in associationwith the next instruction taken.
 6. The method according to claim 1,wherein an instruction is determined to be unaligned if it does notstart at the beginning of a memory address line.
 7. The method accordingto claim 1, wherein a BTAC alignment bit will be one of a 0 or a 1 foran aligned branch instruction and the other of a 0 or a 1 for anunaligned branch instruction.
 8. In a microprocessor, a method ofperforming dynamic branch prediction comprising: fetching thepenultimate instruction word prior to a non-sequential program flow anda branch prediction unit look-up table entry containing predictioninformation for a next instruction on a first clock cycle; fetching thelast instruction word prior to a non-sequential program flow and makinga prediction on this non-sequential program flow based on informationfetched in the previous cycle on a second clock cycle; and fetching thepredicted target instruction on a third clock cycle.
 9. The methodaccording to claim 8, wherein fetching an instruction prior to a branchinstruction and a branch prediction look-up table entry comprises usingthe instruction address of the instruction just prior to the branchinstruction in the program flow index the branch in the branch table.10. The method according to claim 9, wherein if a delay slot instructionappears after the branch instruction, fetching an instruction prior to abranch instruction and a branch prediction look-up table entry comprisesusing the instruction address of the branch instruction, not theinstruction before the branch instruction, to index the BPU tables. 11.A method of updating a look-up table of a branch prediction unit in avariable length instruction set microprocessor, the method comprising:storing a last fetch address of a last instruction of a loop body of azero overhead loop in the branch prediction look-up table; andpredictively re-directing an instruction fetch to the start of the loopbody whenever an instruction fetch hits the end of a loop body, whereinthe last fetch address of the loop body is derived from the address ofthe first instruction after the end of the loop.
 12. The methodaccording to claim 11, wherein storing comprises, if the nextinstruction after the end of the loop body has an aligned address, thelast instruction of the loop body has a last fetch address immediatelypreceding the address of the next instruction after the end of the loopbody, otherwise, if the next instruction after the end of the loop bodyhas an unaligned address, the last instruction of the loop body has alast fetch address the same as the address of the next instruction afterthe loop body.